This code through analyzes flight dataset and predicts the flight delay using regression models.
From this code through it can be observed how to wrangle, clean and transform dataset to prepare it for the models that is used.
This code through also introduces regression models and explains results by confusion matrixes which visualizes and summarizes the performance of a classification algorithm.
First, I will introduce the dataset and variables which I collected.
Second, in the explanatory data analysis section, it can be observed that total minutes of the delayings by its reasons for each airline and percentage of departure delays by day and on weekends for each airlines.
Additionally, three regression models are applied which are Logistic Regression, KNN and Naive Bayes. From the results of regression models prediction of flight delays will be made and performance of those models will be analyzed.
The dataset is raw data collected from Bureau Transportation Statistics which collects and publishes comprehensive transportation statistics. The dataset contains all the flights from Atlanta airport in the months of January 2022 and April 2022 with selected airlines which are Delta Airlines, Spirit Airlines, and Frontier Airlines.
There are around 70,855 rows in this dataset and 17 variables indicating the features of the flight including information about carrier code, month, day, flight number, origin airport, destination airport, airplane information, departure time, arrival time, and delays (related to departure, weather, security etc.).
Lets check the variables:
For your reference, the airline abbreviations used in the dataset refer as follows:
In this section, you will be able to see the trend on which day of the week and weekends contains the most flight delays per airline.
First, lets add Weekend variable into the dataset:
I used grepl() function which searches for matches of certain character pattern and returns TRUE if a string contains the pattern, otherwise FALSE. Since Saturday and Sunday starts with “S”, I adjusted the code according to this condition and added Weekend variable with following code:
dataset$Weekend <- grepl("S.+",weekdays(dataset$`Date (MM/DD/YYYY)`))
dataset %>%
mutate(Weekend=grepl("S.+",weekdays(dataset$`Date (MM/DD/YYYY)`)))Second, lets check the sum of all delay reasons per airline.
To find this, I used pivot_longer function which is used to pivot a data frame from a wide format to a long format so that each column that has a “Delay” reason is pivoted into its own rows. After this, I used group by() and summarise() functions on carrier code and delay category to have sum of delays. Lastly, I plotted graph with ggplot package.
dataset %>%
pivot_longer(cols=contains("Delay"),
names_to='Delay Reasons',
values_to='Delays') %>%
group_by(`Carrier Code`,`Delay Reasons`)%>%
summarise(Delays = sum(Delays))%>%
ggplot(aes(x=Delays,y=`Delay Reasons`, fill=`Carrier Code` ))+
geom_col()+
labs(x="Sum of Departure Delays(mins) ",title = "Sum of Delays by Delay Reasons per Airline ")+
theme_bw()+
theme(plot.title = element_text(hjust = 0.5))+
theme(
axis.title.x = element_text(family="serif", color="black",size = 14, face = "bold"),
axis.title.y = element_text(family="serif", color="black",size = 14, face = "bold"),
plot.title = element_text(family="serif",color="black", size=16, face="bold"),
text=element_text(family="serif",size=14))+
theme(legend.text = element_text(family="serif",size = 10))+
theme(legend.title = element_text(family="serif",face = "bold",size=12))As we can see from the graph, it can be observed that the departure delay is highest number among all delays for each airline and delay security is the lowest one among all delays for each airline. Thus, I decided to focus on departure delay for this analysis
Next, to analyze the flight departure delayings, I created is_delay variable which shows if the departure delay is 15 mins or greater than 15 mins, it will be considered as delay.
I applied ifelse function to create a dummy variable which result 0 and 1. 1 means delays and 0 means non-delays.
dataset$is_delay <- ifelse(dataset$`Departure delay (Minutes)`>= 15, 1,0)
dataset %>%
mutate(is_delay=ifelse(dataset$`Departure delay (Minutes)`>= 15, 1,0))Now, lets check the percentage of departure delayings by day of the week for each airline:
I grouped dataset by airline and days and calculated percentage of the delays with count(),group_by() and sum() functions. Then I used ggplot to visualize the percentage of departure delays by days for each airline.
dataset %>%
count(`Carrier Code`, Day, is_delay) %>%
group_by(`Carrier Code`, Day) %>%
mutate(percent_delay = (n/sum(n)*100) %>% round(1)) %>%
filter(is_delay==1) %>%
ggplot(aes(x = factor(Day), y = percent_delay, group=`Carrier Code`, color =`Carrier Code`)) +
geom_line(size=1) +
labs(title = "Percentage of Departure Delays by the Days of the Week for Each Airline ",y="Percentage of Delays",x="Day of the Week", caption = "**1:Sun | 2:Mon | 3:Tue | 4:Wed | 5:Fri | 6:Sat | 7:Sun**")+
theme_bw()+
theme(plot.title = element_text(hjust = 0.5))+
theme(
axis.title.x = element_text(family="serif",color = "black", size = 10, face = "bold"),
axis.title.y = element_text(family="serif",color = "black", size = 10, face = "bold"),
plot.title = element_text(family="serif",color="black", size=14, face="bold"),
text=element_text(family="serif"))+
theme(legend.text = element_text(family="serif",size = 10))+
theme(legend.title = element_text(family="serif",face = "bold",size=12))+
theme(
legend.position = c(1, .1),
legend.justification = c("right", "bottom"),
legend.box.just = "right",
legend.margin = margin(6, 6, 6, 6),
legend.background = element_rect(fill = "white", colour = "black"))
The graph shows the highest percentage of departure delay is on Saturday for each airline. Among three airlines, mainly the lower percentage delay is in Delta Airlines.
Lastly lets find out which airline has the highest percentage of departure delay on weekend:
Again, to find which airline has the most percentage of departure delay on weekend, I grouped by darrier code and weekend variables and calculated percentage of delay with sum function. Then I created pie chart with ggplot().
dataset %>%
count(`Carrier Code`, Weekend, is_delay) %>%
group_by(`Carrier Code`, Weekend) %>%
mutate(percent_delay = (n/sum(n)*100) %>% round(1)) %>%
filter(is_delay==1 & Weekend==1) %>%
ggplot(aes(x = "", y = percent_delay, fill = fct_inorder(`Carrier Code`))) +
geom_col(width = 1, color = 1) +
geom_text(aes(label = paste0(percent_delay, "%")),
position = position_stack(vjust = 0.5),
size=4,family="serif") +
coord_polar(theta = "y") +
labs(title="The percentage of Flight Departure Delays on Weekends per Airline")+
scale_fill_brewer(palette = "Pastel1")+
guides(fill = guide_legend(title = "Airlines")) +
theme_void()+
theme(axis.line = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
plot.title = element_text(hjust = 0.5,vjust=-1),
text=element_text(family="serif"))+
theme(legend.text = element_text(family="serif",size = 10))+
theme(legend.title = element_text(family="serif",face = "bold",size=12))
According to the pie chart, the highest percentage of departure
delay on weekends is 28.4% at Frontier Airlines while the lowest
percentage of departure delay on weekends is 19.7% at Delta Airlines.
To predict the flight delays, three methods is used: Logistic regression model, K- Nearest Neighbors Model (KNN), and Naïve Bayes.
Before starting regression models, to fit the models I applied lm() function. Then I applied summary() function to interpret the most important statistical values for the analysis.
correlation <-
lm(formula = `Departure delay (Minutes)` ~ Month + Day + Weekend + `Flight Number`, data=dataset)
summary(correlation)##
## Call:
## lm(formula = `Departure delay (Minutes)` ~ Month + Day + Weekend +
## `Flight Number`, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.40 -13.35 -9.79 -2.04 1046.10
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.7055366 0.5394328 6.869 0.0000000000065039 ***
## Month 2.1479429 0.1260504 17.040 < 0.0000000000000002 ***
## Day 0.6157975 0.0713288 8.633 < 0.0000000000000002 ***
## WeekendTRUE 0.6683261 0.3174769 2.105 0.0353 *
## `Flight Number` -0.0012528 0.0001672 -7.492 0.0000000000000688 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.71 on 70850 degrees of freedom
## Multiple R-squared: 0.00624, Adjusted R-squared: 0.006184
## F-statistic: 111.2 on 4 and 70850 DF, p-value: < 0.00000000000000022
It can be observed from the outcome above, the variables that I will use are statistically significant.
For logistic regression I will use following variables so that I selected those variables:
log_reg <- dataset %>%
select(`Carrier Code`, `Date (MM/DD/YYYY)`, Month, Day, Weekend, `Flight Number`, is_delay)
head(log_reg)log_reg$is_delay <- as.factor(log_reg$is_delay)The target variable which is is_delay converted into factor variable for the analysis.
First, the dataset is splited into train and test datasets as 70% of train set and 30% of test set with the following code:
#Splitting
set.seed(1234)
index<-createDataPartition(log_reg$is_delay,p=0.7,list=FALSE)
train<- log_reg[index,]
test<- log_reg[-index,]
table(train$is_delay)##
## 0 1
## 40316 9283
set.seed() generate random numbers createDataPartition() function is used to split data as 0.7 train and 0.3 test dataset, then it is assigned as the name of train and test dataset.
The sample sizes are controlled by table() function; there are 40,316 non-delays and 9283 delays
Next, to increase statistical power, sample sizes between delays and non-delays are balanced with the downSample() function which decreases the size of the majority class to be the same or closer to the minority class size by just taking out a random sample.
#Downsample
set.seed(111)
traindown<-downSample(x=train[,-ncol(train)],
y=train$is_delay)
table(traindown$Class)##
## 0 1
## 9283 9283
As we can see there are 9283 delays and 9283 non-delays sample sizes.
Then, logistic regression model is applied with dependent and independent variables and assigned as the name of modeldown with the following code:
# Model
glm(Class ~ `Carrier Code` + Month + Day + Weekend + `Flight Number`, data=traindown, family=binomial(link = "logit"))##
## Call: glm(formula = Class ~ `Carrier Code` + Month + Day + Weekend +
## `Flight Number`, family = binomial(link = "logit"), data = traindown)
##
## Coefficients:
## (Intercept) `Carrier Code`F9 `Carrier Code`NK Month
## -0.55370332 0.64325330 0.55576311 0.19834226
## Day WeekendTRUE `Flight Number`
## 0.03119752 0.13448737 -0.00009894
##
## Degrees of Freedom: 18565 Total (i.e. Null); 18559 Residual
## Null Deviance: 25740
## Residual Deviance: 25330 AIC: 25340
modeldown <- glm(Class ~ `Carrier Code` + Month + Day + Weekend + `Flight Number`, data=traindown, family=binomial(link = "logit"))After the model, predict() function is used to predict the flight departure delays and then threshold selected as 0.5 which means that it makes prediction for each row where that probability is greater than or equal to 0.5.
# Predicting
predicted <- predict(modeldown, test, type="response")
#Threshold
predicted <-ifelse(predicted> 0.5,1,0)
predicted <- as.factor(predicted)The class of “predicted” values converted to the factors for the analysis.
Lastly, confusion matrix is created with the
following code:
confusionMatrix(predicted, test$is_delay)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 9693 1837
## 1 7585 2141
##
## Accuracy : 0.5567
## 95% CI : (0.55, 0.5634)
## No Information Rate : 0.8129
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0638
##
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.5610
## Specificity : 0.5382
## Pos Pred Value : 0.8407
## Neg Pred Value : 0.2201
## Prevalence : 0.8129
## Detection Rate : 0.4560
## Detection Prevalence : 0.5424
## Balanced Accuracy : 0.5496
##
## 'Positive' Class : 0
##
Confusion Matrix explained in the Findings section.
For KNN I will use following variables so that I selected those variables.
knn <- dataset %>%
select(`Carrier Code`, Month, Day, Weekend, `Flight Number`, is_delay)
head(knn)knn$is_delay <- as.factor(knn$is_delay)
knn$Weekend <- as.numeric(knn$Weekend)
knn$`Carrier Code` = as.numeric(as.factor(knn$`Carrier Code`))The target variable which is is_delay converted into factor variable for the analysis.
First, the dataset is splited into train and test datasets as 70% of train set and 30% of test set with the following code:
#Splitting
set.seed(1234)
to_take <- floor(0.70* nrow(knn))
set.seed(111)
train_idx <- sample(seq_len(nrow(knn)), size = to_take)
train2 <- knn[train_idx, ]
test2 <- knn[-train_idx, ]
table(train2$is_delay)##
## 0 1
## 40353 9245
train2$`Carrier Code` <- as.numeric(as.factor(train2$`Carrier Code`))
test2$`Carrier Code` <- as.numeric(as.factor(test2$`Carrier Code`))
train_scale <- scale(train2[, 1:5])
test_scale <- scale(test2[, 1:5])set.seed() generate random numbers With createDataPartition() function is used to split data as 0.7 train and 0.3 test dataset, then it is assigned as the name of train and test dataset.
The sample sizes are controlled by table() function; there are 40,353 non-delays and 9245 delays
Next, to increase statistical power, sample sizes between delays and non-delays are balanced with the downSample() function which decreases the size of the majority class to be the same or closer to the minority class size by just taking out a random sample.
#Downsample
set.seed(111)
traindown2<-downSample(x=train_scale[,-ncol(train_scale)],
y=train2$is_delay)
table(traindown2$Class)##
## 0 1
## 9245 9245
As we can see there are 9245 delays and 9245 non-delays sample sizes.
Then, knn model is applied with dependent and independent variables and assigned as the name of classifier_knn with the following code:
# Model
classifier_knn <- knn(train = traindown2,
test = test_scale,
cl = traindown2$Class,
k = 1)After model, confusion matrix is created with the following code:
#Confusion Matrix
confusionMatrix(classifier_knn, test2$is_delay)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 10046 2536
## 1 7195 1480
##
## Accuracy : 0.5422
## 95% CI : (0.5355, 0.5489)
## No Information Rate : 0.8111
## P-Value [Acc > NIR] : 1
##
## Kappa : -0.0338
##
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.5827
## Specificity : 0.3685
## Pos Pred Value : 0.7984
## Neg Pred Value : 0.1706
## Prevalence : 0.8111
## Detection Rate : 0.4726
## Detection Prevalence : 0.5919
## Balanced Accuracy : 0.4756
##
## 'Positive' Class : 0
##
Confusion Matrix explained in the Findings section.
For Naive Bayes I will use following variables so that I selected those variables.
NB <- dataset %>%
select(`Carrier Code`, Month, Day, Weekend, `Flight Number`, is_delay)
NB$is_delay <- as.factor(NB$is_delay)
NB$Weekend <- as.numeric(NB$Weekend)
NB$`Carrier Code` = as.numeric(as.factor(NB$`Carrier Code`))
head(NB)The target variable which is is_delay converted into factor variable for the analysis.
First, the dataset is splited into train and test datasets as 70% of train set and 30% of test set with the following code:
#Splitting
set.seed(1234)
sample <- sample(c(TRUE, FALSE), nrow(NB), replace=TRUE, prob=c(0.7,0.3))
train3 <- NB[sample, ]
test3 <- NB[!sample, ]
table(train3$is_delay)##
## 0 1
## 40162 9299
set.seed() generate random numbers With createDataPartition() function is used to split data as 0.7 train and 0.3 test dataset, then it is assigned as the name of train and test dataset.
The sample sizes are controlled by table() function; there are 40,162 non-delays and 9299 delays
Next, to increase statistical power, sample sizes between delays and non-delays are balanced with the downSample() function which decreases the size of the majority class to be the same or closer to the minority class size by just taking out a random sample.
#Downsample
set.seed(111)
traindown3<-downSample(x=train3[,-ncol(train3)],
y=train3$is_delay)
table(traindown3$Class)##
## 0 1
## 9299 9299
As we can see there are 9299 delays and 9299 non-delays sample sizes.
Then, Naive Bayes model is applied with dependent and independent variables and assigned as the name of classifier_cl with the following code:
# Fitting Naive Bayes Model
# to training dataset
set.seed(120) # Setting Seed
classifier_cl <- naiveBayes(Class ~ ., data = traindown3)
classifier_cl##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## 0 1
## 0.5 0.5
##
## Conditional probabilities:
## Carrier Code
## Y [,1] [,2]
## 0 1.089795 0.3885904
## 1 1.148188 0.4852965
##
## Month
## Y [,1] [,2]
## 0 2.493709 1.114077
## 1 2.722766 1.117206
##
## Day
## Y [,1] [,2]
## 0 3.988493 1.974939
## 1 4.083020 2.065498
##
## Weekend
## Y [,1] [,2]
## 0 0.2683084 0.4431028
## 1 0.2990644 0.4578727
##
## Flight Number
## Y [,1] [,2]
## 0 1730.418 857.7741
## 1 1656.907 848.7378
After the model, predict() function applied on test dataset to predict the flight departure delays:
# Predicting on test dataset
y_pred <- predict(classifier_cl, newdata = test3)Lastly, confusion matrix is created with the following code:
# Confusion Matrix
cm3 <- table(y_pred, test3$is_delay)
# Model Evaluation
cm3 <- confusionMatrix(cm3)
cm3## Confusion Matrix and Statistics
##
##
## y_pred 0 1
## 0 13794 2832
## 1 3638 1130
##
## Accuracy : 0.6976
## 95% CI : (0.6914, 0.7037)
## No Information Rate : 0.8148
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0709
##
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.7913
## Specificity : 0.2852
## Pos Pred Value : 0.8297
## Neg Pred Value : 0.2370
## Prevalence : 0.8148
## Detection Rate : 0.6448
## Detection Prevalence : 0.7771
## Balanced Accuracy : 0.5383
##
## 'Positive' Class : 0
##
Confusion Matrix explained in the Findings section.
#Confusion Matrix
cm <- confusionMatrix(predicted, test$is_delay)
draw_confusion_matrix <- function(cm) {
total <- sum(cm$table)
res <- as.numeric(cm$table)
# Generate color gradients. Palettes come from RColorBrewer.
greenPalette <- c("#F7FCF5","#E5F5E0","#C7E9C0","#A1D99B","#74C476","#41AB5D","#238B45","#006D2C","#00441B")
redPalette <- c("#FFF5F0","#FEE0D2","#FCBBA1","#FC9272","#FB6A4A","#EF3B2C","#CB181D","#A50F15","#67000D")
getColor <- function (greenOrRed = "green", amount = 0) {
if (amount == 0)
return("#FFFFFF")
palette <- greenPalette
if (greenOrRed == "red")
palette <- redPalette
colorRampPalette(palette)(100)[10 + ceiling(90 * amount / total)]
}
# set the basic layout
layout(matrix(c(1,1,2)))
par(mar=c(2,2,2,2))
plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')
title('CONFUSION MATRIX - Logistic Regression', cex.main=2)
# create the matrix
classes = colnames(cm$table)
rect(150, 430, 240, 370, col=getColor("green", res[1]))
text(195, 435, classes[1], cex=1.2)
rect(250, 430, 340, 370, col=getColor("red", res[3]))
text(295, 435, classes[2], cex=1.2)
text(125, 370, 'Predicted', cex=1.3, srt=90, font=2)
text(245, 450, 'Actual', cex=1.3, font=2)
rect(150, 305, 240, 365, col=getColor("red", res[2]))
rect(250, 305, 340, 365, col=getColor("green", res[4]))
text(140, 400, classes[1], cex=1.2, srt=90)
text(140, 335, classes[2], cex=1.2, srt=90)
# add in the cm results
text(195, 400, res[1], cex=1.6, font=2, col='white')
text(195, 335, res[2], cex=1.6, font=2, col='white')
text(295, 400, res[3], cex=1.6, font=2, col='white')
text(295, 335, res[4], cex=1.6, font=2, col='white')
# add in the specifics
plot(c(100, 0), c(100, 0), type = "n", xlab="", ylab="", main = "DETAILS", xaxt='n', yaxt='n')
text(10, 85, names(cm$byClass[1]), cex=1.2, font=2)
text(10, 70, round(as.numeric(cm$byClass[1]), 3), cex=1.2)
text(30, 85, names(cm$byClass[2]), cex=1.2, font=2)
text(30, 70, round(as.numeric(cm$byClass[2]), 3), cex=1.2)
text(50, 85, names(cm$byClass[5]), cex=1.2, font=2)
text(50, 70, round(as.numeric(cm$byClass[5]), 3), cex=1.2)
text(70, 85, names(cm$byClass[6]), cex=1.2, font=2)
text(70, 70, round(as.numeric(cm$byClass[6]), 3), cex=1.2)
text(90, 85, names(cm$byClass[7]), cex=1.2, font=2)
text(90, 70, round(as.numeric(cm$byClass[7]), 3), cex=1.2)
# add in the accuracy information
text(30, 35, names(cm$overall[1]), cex=1.5, font=2)
text(30, 20, round(as.numeric(cm$overall[1]), 3), cex=1.4)
text(70, 35, names(cm$overall[2]), cex=1.5, font=2)
text(70, 20, round(as.numeric(cm$overall[2]), 3), cex=1.4)
}
draw_confusion_matrix(cm)Precision rate shows how many of the correctly predicted delayings actually turned out to be delaying. Sensitivity shows how many of the actual delayings I was able to predict correctly with the model while specificity shows how many of the non-delays were correctly classified by model.
According to the confusion matrix of logistic regression, the model’s accuracy is 55% which is average. 84% of the correctly predicted delayings turned out to be delaying. Whereas 56% of the delayings were successfully predicted by the model. That is average!
cm2 <- confusionMatrix(classifier_knn, test2$is_delay)
draw_confusion_matrix <- function(cm2) {
total <- sum(cm2$table)
res <- as.numeric(cm2$table)
# Generate color gradients. Palettes come from RColorBrewer.
greenPalette <- c("#F7FCF5","#E5F5E0","#C7E9C0","#A1D99B","#74C476","#41AB5D","#238B45","#006D2C","#00441B")
redPalette <- c("#FFF5F0","#FEE0D2","#FCBBA1","#FC9272","#FB6A4A","#EF3B2C","#CB181D","#A50F15","#67000D")
getColor <- function (greenOrRed = "green", amount = 0) {
if (amount == 0)
return("#FFFFFF")
palette <- greenPalette
if (greenOrRed == "red")
palette <- redPalette
colorRampPalette(palette)(100)[10 + ceiling(90 * amount / total)]
}
# set the basic layout
layout(matrix(c(1,1,2)))
par(mar=c(2,2,2,2))
plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')
title('CONFUSION MATRIX - KNN', cex.main=2)
# create the matrix
classes = colnames(cm2$table)
rect(150, 430, 240, 370, col=getColor("green", res[1]))
text(195, 435, classes[1], cex=1.2)
rect(250, 430, 340, 370, col=getColor("red", res[3]))
text(295, 435, classes[2], cex=1.2)
text(125, 370, 'Predicted', cex=1.3, srt=90, font=2)
text(245, 450, 'Actual', cex=1.3, font=2)
rect(150, 305, 240, 365, col=getColor("red", res[2]))
rect(250, 305, 340, 365, col=getColor("green", res[4]))
text(140, 400, classes[1], cex=1.2, srt=90)
text(140, 335, classes[2], cex=1.2, srt=90)
# add in the cm2 results
text(195, 400, res[1], cex=1.6, font=2, col='white')
text(195, 335, res[2], cex=1.6, font=2, col='white')
text(295, 400, res[3], cex=1.6, font=2, col='white')
text(295, 335, res[4], cex=1.6, font=2, col='white')
# add in the specifics
plot(c(100, 0), c(100, 0), type = "n", xlab="", ylab="", main = "DETAILS", xaxt='n', yaxt='n')
text(10, 85, names(cm2$byClass[1]), cex=1.2, font=2)
text(10, 70, round(as.numeric(cm2$byClass[1]), 3), cex=1.2)
text(30, 85, names(cm2$byClass[2]), cex=1.2, font=2)
text(30, 70, round(as.numeric(cm2$byClass[2]), 3), cex=1.2)
text(50, 85, names(cm2$byClass[5]), cex=1.2, font=2)
text(50, 70, round(as.numeric(cm2$byClass[5]), 3), cex=1.2)
text(70, 85, names(cm2$byClass[6]), cex=1.2, font=2)
text(70, 70, round(as.numeric(cm2$byClass[6]), 3), cex=1.2)
text(90, 85, names(cm2$byClass[7]), cex=1.2, font=2)
text(90, 70, round(as.numeric(cm2$byClass[7]), 3), cex=1.2)
# add in the accuracy information
text(30, 35, names(cm2$overall[1]), cex=1.5, font=2)
text(30, 20, round(as.numeric(cm2$overall[1]), 3), cex=1.4)
text(70, 35, names(cm2$overall[2]), cex=1.5, font=2)
text(70, 20, round(as.numeric(cm2$overall[2]), 3), cex=1.4)
}
draw_confusion_matrix(cm2)According to the confusion matrix of KNN, the model’s accuracy is 54% which is average. 80% of the correctly predicted delayings turned out to be delaying. Whereas 58% of the delayings were successfully predicted by the model. That is also average!
draw_confusion_matrix <- function(cm3) {
total <- sum(cm3$table)
res <- as.numeric(cm3$table)
# Generate color gradients. Palettes come from RColorBrewer.
greenPalette <- c("#F7FCF5","#E5F5E0","#C7E9C0","#A1D99B","#74C476","#41AB5D","#238B45","#006D2C","#00441B")
redPalette <- c("#FFF5F0","#FEE0D2","#FCBBA1","#FC9272","#FB6A4A","#EF3B2C","#CB181D","#A50F15","#67000D")
getColor <- function (greenOrRed = "green", amount = 0) {
if (amount == 0)
return("#FFFFFF")
palette <- greenPalette
if (greenOrRed == "red")
palette <- redPalette
colorRampPalette(palette)(100)[10 + ceiling(90 * amount / total)]
}
# set the basic layout
layout(matrix(c(1,1,2)))
par(mar=c(2,2,2,2))
plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')
title('CONFUSION MATRIX - Naive Bayes', cex.main=2)
# create the matrix
classes = colnames(cm3$table)
rect(150, 430, 240, 370, col=getColor("green", res[1]))
text(195, 435, classes[1], cex=1.2)
rect(250, 430, 340, 370, col=getColor("red", res[3]))
text(295, 435, classes[2], cex=1.2)
text(125, 370, 'Predicted', cex=1.3, srt=90, font=2)
text(245, 450, 'Actual', cex=1.3, font=2)
rect(150, 305, 240, 365, col=getColor("red", res[2]))
rect(250, 305, 340, 365, col=getColor("green", res[4]))
text(140, 400, classes[1], cex=1.2, srt=90)
text(140, 335, classes[2], cex=1.2, srt=90)
# add in the cm3 results
text(195, 400, res[1], cex=1.6, font=2, col='white')
text(195, 335, res[2], cex=1.6, font=2, col='white')
text(295, 400, res[3], cex=1.6, font=2, col='white')
text(295, 335, res[4], cex=1.6, font=2, col='white')
# add in the specifics
plot(c(100, 0), c(100, 0), type = "n", xlab="", ylab="", main = "DETAILS", xaxt='n', yaxt='n')
text(10, 85, names(cm3$byClass[1]), cex=1.2, font=2)
text(10, 70, round(as.numeric(cm3$byClass[1]), 3), cex=1.2)
text(30, 85, names(cm3$byClass[2]), cex=1.2, font=2)
text(30, 70, round(as.numeric(cm3$byClass[2]), 3), cex=1.2)
text(50, 85, names(cm3$byClass[5]), cex=1.2, font=2)
text(50, 70, round(as.numeric(cm3$byClass[5]), 3), cex=1.2)
text(70, 85, names(cm3$byClass[6]), cex=1.2, font=2)
text(70, 70, round(as.numeric(cm3$byClass[6]), 3), cex=1.2)
text(90, 85, names(cm3$byClass[7]), cex=1.2, font=2)
text(90, 70, round(as.numeric(cm3$byClass[7]), 3), cex=1.2)
# add in the accuracy information
text(30, 35, names(cm3$overall[1]), cex=1.5, font=2)
text(30, 20, round(as.numeric(cm3$overall[1]), 3), cex=1.4)
text(70, 35, names(cm3$overall[2]), cex=1.5, font=2)
text(70, 20, round(as.numeric(cm3$overall[2]), 3), cex=1.4)
}
draw_confusion_matrix(cm3)
According to the confusion matrix of Naive Bayes, the model’s
accuracy is 70% which is average. 83% of the correctly predicted
delayings turned out to be delaying. Whereas 79% of the delayings were
successfully predicted by the model. That is good! However, specificity
rate is 28% which is low.
According to the analysis in Exploratory Data Analysis section, the lowest percentage of departure delay is at Delta Airlines and the highest percentage of departure delay is on Saturday for each airline.
Overall, between those three models, the more accurate model is Naive Bayes with 72% rate but it failed in specificity rate that is low. Between KNN and logistic regression models, logistic regression has better precision and accuracy level.
Learn more about regression model applications using R with the following:
Resource I Logistic Regression
Resource II KNN
Resource III Naive Bayes
This code through references and cites the following sources:
Modeling with R (2019). Source I. DownSample
Cybernetic (2017). Source II. Confusion Matrix
Henderson,C. (2021). Source III. Flight Delay Analysis